Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast codepointOffset #451

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Fast codepointOffset #451

wants to merge 16 commits into from

Conversation

axman6
Copy link

@axman6 axman6 commented Jul 1, 2022

Implements codepointOffset with code from the FreeBSD project.

I'm planning to explore making a vectorised implementation of the searching for 2, 3 and 4 char codepoints, but will leave that out in the first iteration.

This may be relevant to #369, by eliminating the need to decode codepoints via Haskell.

@axman6
Copy link
Author

axman6 commented Jul 1, 2022

I'm not sure why older GHCs are unable to infer the types for the tests I've added, since the types should all be trivially known (Text and Char).

@axman6 axman6 marked this pull request as ready for review July 1, 2022 12:32
@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 1, 2022

Thanks @axman6! I suggest we start with splitOnChar / breakOnChar in a separate PR. First naive implementation, tests and benchmarks, then make it fast with whatever it takes. Tackling both splitOnChar and memmem in one go feels a bit overwhelming.

@axman6
Copy link
Author

axman6 commented Jul 2, 2022

Yeah I've been working on rewriting the C to avoid going via memmem, and removing the twoway_memmem would significantly reduce the amount of code to maintain. I would guess there are faster memmem implementations out there, hopefully under permissive licenses too. I'll get the changes working and push those today.

@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 2, 2022

I have a suspicion that breakOnChar / splitOnChar does not mandate any additional C code at all. It might be enough to memchr the least significant byte of the UTF-8 encoding and then check manually that other bytes match.

Anyways, let's separate concerns. From my perspective the first task is to add breakOnChar / splitOnChar with naive, pure Haskell implementation. Once it is done and merged, we can discuss optimizations in a separate PR.

@axman6
Copy link
Author

axman6 commented Jul 16, 2022

I'll try and find some time to write a Haskell only version, and then we can think about making a faster C one later. I wonder if it's worth having both, and only moving to the C call when there's enough data to justify it.

@Bodigrim Bodigrim marked this pull request as draft April 11, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants